Rare genetic variants in genes and loci linked to dominant monogenic developmental disorders cause milder related phenotypes in the general population

Summary Many rare monogenic diseases are known to be caused by deleterious variants in thousands of genes, however the same variants can also be found in people without the associated clinical phenotypes. The penetrance of these monogenic variants is generally unknown in the wider population, as they are typically identified in small clinical cohorts of affected individuals and families with highly penetrant variants. Here, we investigated the phenotypic effect of rare, potentially deleterious variants in genes and loci where similar variants are known to cause monogenic developmental disorders (DDs) in a large population cohort. We used UK Biobank to investigate phenotypes associated with rare protein-truncating and missense variants in 599 monoallelic DDG2P genes by using whole-exome-sequencing data from ∼200,000 individuals and rare copy-number variants overlapping known DD loci by using SNP-array data from ∼500,000 individuals. We found that individuals with these likely deleterious variants had a mild DD-related phenotype, including lower fluid intelligence, slower reaction times, lower numeric memory scores, and longer pairs matching times compared to the rest of the UK Biobank cohort. They were also shorter, had a higher BMI, and had significant socioeconomic disadvantages: they were less likely to be employed or be able to work and had a lower income and higher deprivation index. Our findings suggest that many genes routinely tested within pediatric genetics have deleterious variants with intermediate penetrance that may cause lifelong sub-clinical phenotypes in the general adult population.


Introduction
Deleterious variants in thousands of genes have been shown to cause rare, monogenic diseases. 1 However, not all individuals with these variants share the same clinical phenotypes; some don't appear to be affected at all, whereas others are very severely affected. 2 Monogenic variants can display different effects in different individuals. 3 The range of phenotypes caused by deleterious variants in the same gene can be explained by pleiotropy, incomplete penetrance, and variable expressivity. 4 Penetrance (i.e., whether an individual with a disease-causing genotype displays the corresponding clinical phenotype) is generally binary; either a variant is penetrant and causes the clinical phenotype associated with that genotype or it is not. 2,5 In contrast, variable expressivity (i.e., the range of phenotypes that can be observed in affected individuals) is generally continuous, e.g., from mild to severe. 6 Although incomplete penetrance and variable expressivity are distinct concepts, in practice they can be hard to separate, especially when considering continuous phenotypes in populations.
As most monogenic disease-causing variants have been identified through small clinical cohorts, including families with multiple affected individuals, penetrance of these variants is often overestimated. Investigating the effect of these variants in the general population is therefore important to give a more accurate view of the penetrance in clinically unselected individuals and families. It has been suggested that many of the primary symptoms of rare disease are actually extremes of normally distributed phenotypes in the general population. 1,7 Large, well genotyped population cohorts give us the ability to investigate the spectrum of phenotypes of individuals with variants in genes known to cause monogenic disease. Phenotypic heterogeneity and variability are a major concern for rare Mendelian disorders, where they can lead to incorrect or delayed diagnoses. 8,9 Many severe developmental disorders (DDs) manifest from birth or early childhood and are caused by rare damaging variants in around 2,000 genes and loci. 10 Pathogenic variants in these genes have been identified primarily through phenotype-led clinical studies of affected individuals and families. 4 Due to extensive genetic and phenotypic heterogeneity, large multigene panels are increasingly being used for diagnostic testing, often through panel-based virtual analysis of whole-exome-sequence or whole-genome-sequence data. However, little is known about what effect, if any, deleterious variants in these genes have on adults in the general population or their lifelong implications. In this study, using genetic and phenotypic data from UK Biobank (UKB), 11 we investigated whether adults with rare deleterious variants in genes and loci linked to dominant monogenic DD have any developmentally relevant phenotypes.

UK biobank cohort
UKB is a population-based cohort from the UK with deep phenotyping data and genetic data for around 500,000 individuals aged 40-70 years at recruitment. Individuals provided a variety of information via self-report questionnaires, cognitive and anthropometric measurements, and Hospital Episode Statistics (HES) including ICD9 and ICD10 codes. Genotypes for single-nucleotide polymorphisms (SNPs) were generated with the Affymetrix Axiom UK Biobank array ($450,000 individuals) and the UK BiLEVE array ($50,000 individuals). This dataset underwent extensive central quality control (http://biobank.ctsu.ox.ac.uk). A subset of $200,000 individuals also underwent whole-exome sequencing (WES) with the IDT xGen Exome Research Panel v1.0; this dataset was made available for research in October 2020. Detailed sequencing and variant detection methodology for UKB is available at https://biobank.ctsu.ox.ac.uk/showcase/label.cgi?id¼170. The UKB resource was approved by the UK Biobank Research Ethics Committee and all participants provided written informed consent to participate.

Gene selection
We used the clinically curated Developmental Disorders Gene2Phenotype Database (DDG2P; https://www.ebi.ac.uk/ gene2phenotype/) to select genes where rare variants are known to cause monogenic DD. The database (accessed on 27 November 2020) was constructed from published literature and provides information relating to genes, variants, and phenotypes associated with DDs, including mode of inheritance and mechanism of pathogenicity. 10 We initially included all genes that had been annotated as monoallelic (i.e., autosomal dominant) with an evidence level of ''confirmed'' or ''probable'' (n ¼ 599). Further subsets of these genes were selected for sensitivity analyses, including a panel of 325 genes where variants are known to cause DD through a loss-of-function (LoF) haploinsufficiency mechanism; a more stringent panel of 125 of these genes that were significantly enriched for damaging de novo LoF mutations in a recent analysis of 31,058 DD probands; 12 and a small panel of 25 genes where deleterious variants cause clinically well-established syndromes, with >30 likely pathogenic de novo LoF mutations in the same study 12 (see Figure 1 and Table S1).

Variant selection
To investigate the penetrance of likely deleterious single-nucleotide variants (SNVs) and insertions/deletions (indels) in genes where rare variants are known to cause autosomal dominant DD, we used WES data from 200,632 individuals in UKB to identify individuals with a rare SNVs and/or indels in any of these genes. For most of our analyses, rare was defined as any variant that occurred in 5 or fewer individuals in the UKB cohort; we also investigated the effect of changing this threshold to n ¼ 1, n ¼ 10, n ¼ 50, and n ¼ 100 individuals. We included variants that had individual and variant missingness < 10%, minimum read depth of 7 for SNVs and 10 for indels, and at least one sample per site passed the allele balance threshold > 15% for SNVs and 20% for indels. We selected three functional classes of variant in canonical transcripts based on annotation in genome build GRCh38 by the Ensembl Variant Effect Predictor. 13 (1) Likely deleterious LoF variants: we defined an LoF variant as one that is predicted to cause a premature stop, a frameshift, or abolish a canonical splice site; only those variants deemed to be high confidence by the Loss-Of-Function Transcript Effect Estimator (LOFTEE) were retained (https://github.com/konradjk/ loftee). (2) Likely deleterious missense variants: missense variants with a REVEL score > 0.7. 14 A further set of likely deleterious missense variants were identified via CADD, 15 with scores of >20, >25, and >30. Individuals with variants in group 1 were excluded from groups 2 and 3; individuals with variants in group 2 were excluded from group 3. Following variant selection, one gene (DNMT3A) was removed from further analysis as the variants in this gene-which is known to be strongly linked to blood cancer 16 -had a significantly lower allele balance, suggesting substantial somatic mosaicism (see Figure S1). Other genes linked to blood cancer, such as ASXL1 and TET3, were examined but showed no difference in allele balance compared to the remainder of the LoF variants identified. LoF variants in the most stringent 25 gene subset were visually confirmed with the Integrative Genomics Viewer (IGV).
To investigate the penetrance of multigenic copy-number variants (CNVs) overlapping known DD loci, we used SNP-array data from 488,377 genotyped individuals in UKB and Figure 1. Flow diagram outlining selection process for genes in each subset that were used for analysis DDG2P, Developmental Disorders Genotype-to-Phenotype database; DD, developmental disorder; LoF, loss-of-function variants; 31K DD trios, 31,058 parent-offspring families with developmental disorders. 12 PennCNV 17 (version 1.0.4) to detect multigenic CNVs overlapping 69 published CNVs strongly associated with developmental delay. 18,19 Log R ratio (LRR) and B-allele frequency (BAF) values for 805,426 genome-wide SNP probe sets were provided by UKB, and we used an in-house script to convert these data to PennCNV input signal files. The PennCNV hidden Markov model (HMM) transition matrix was trained with 250 random UKB samples with PennCNV-train. Population Frequency B Allele reference data (PFB) were generated via 1,000 random UKB samples. We then used PennCNV-test to detect regions in a duplication or deletion state in LRR/BAF HMM with the generated PFB and transition matrix. An individual was classified as having a multigenic DD deletion or duplication if the region detected with PennCNV reciprocally intersected the published region by at least 50%. We plotted LRR/BAF data for each call in each of these regions, and carried out visual inspection of each event, and false positives and single gene CNVs were excluded. A list of included CNVs is provided in Table S2.

Statistical analysis
We performed both individual gene and gene panel burden tests across our different gene subsets. We grouped individuals into one of five groups depending upon the type of variant they carried (LoF, missense, or synonymous variant in one or more monoallelic DDG2P gene or deletion or duplication overlapping published DD multigenic CNVs). Association tests were limited to individuals in UKB with genetically defined European ancestry that were unrelated up to third-degree relationship (184,142 with WES data; 380,029 with SNP-array data) and were controlled for age, sex, recruitment center, and 40 principal components. Variant burden association tests in gene panels and multigenic CNVs were performed with STATA (version 16.0) with linear regression for continuous phenotypes and logistic regression for the binary phenotypes. Associations were tested between each group of individuals and other individuals in the UKB cohort without any of the classes of rare variation defined above. Information from HES codes, self-report questionnaires, and cognitive tests taken at recruitment was used for the phenotypic information. Associations were tested for 20 UKB phenotypes selected on the basis of their likely relevance to DDs, including the following.  Figure 2, and Figure 3). Individuals carrying these variants generally had lower cognitive performance than the rest of the cohort, with reduced fluid intelligence (LoF group beta: À1.059), slower reaction times (LoF group beta: þ0.043), lower numeric memory scores (LoF group beta: À0.068), and longer pairs matching times (LoF group beta: þ0.122). They also completed fewer years in education, left education at an earlier age, and were less likely to have a degree. Medically, individuals were more likely to have reported a mental health issue or been diagnosed with either a childhood DD (including mild-severe intellectual disability, epilepsy, autism, ADHD, and congenital malformations) or adult DD (including schizophrenia and bipolar disorder). Individuals were also more likely to be shorter, have a higher BMI, and have had fewer children (though the latter association was only significant in men). Individuals also had significant socioeconomic disadvantages: they were less likely to be employed or be able to work and had a lower income and a higher TDI. Across all phenotypes tested, we observed a trend corresponding to the likely deleteriousness of the variants; the largest effect was generally observed in the group of individuals with multigenic deletions, followed by multigenic duplications, then LoF variants, and finally missense variants in one (or more) DDG2P genes. These trends were robust to the use of different CADD thresholds for selecting of missense variants (see Table S4) and to removal of individuals with a diagnosed childhood developmental disorder (''child DD,'' as defined in material and methods, n ¼ 3,132; see Table S5). In contrast, individuals with only rare synonymous variants in these genes showed no statistically significant difference in any phenotype compared to the remainder of the cohort, as expected for likely benign variants, suggesting that most of the confounding caused by population sub-structure was appropriately controlled.
Potentially damaging LoF variants were found even in genes associated with clinically well-established developmental syndromes We repeated our association analysis with smaller, more stringent, subsets of 325, 125, and 25 DDG2P genes ( Figure 1). Interestingly, even within the most stringent subset of 25 genes where rare variants are thought to cause highly penetrant severe forms of DD via haploinsufficiency, with >30 de novo LoF mutations identified in 31,058 DD probands, 12 we were able to identify 167 individuals in UKB who had a high confidence LoF variant in one of these genes. We observed similar trends to the full 599 gene panel for LoF variants in smaller subsets of genes in which variants cause DD by haploinsufficiency: the group overall exhibited mild DD-related phenotypes, although the results were less significant because of the smaller number of individuals carrying likely LoF variants (Table S3). Nonetheless, a Bonferroni-corrected significant result was seen across all gene subsets for shorter stature, reduced chance of having a degree, and increased TDI; lower fluid intelligence, lower income, higher BMI, and an increased chance of being diagnosed with a child DD also remained nominally significant even in the 25 gene subset. We also performed single-gene burden testing but were under-powered to find any significant associations for most genes as a result of the small number of individuals and likely mild phenotypic effects in UKB. Interestingly, despite previously reaching genome-wide significance for enrichment of damaging de novo mutations, MIB1 had the largest number of individuals carrying likely LoF variants in UKB (n ¼ 260), more than the 25 most stringent genes combined, but showed no associations with any DD-related phenotypes. The gene also has almost double the number of LoF variants observed versus expected in gnomAD (https://gnomad.broadinstitute.org/ gene/MIB1), and thus appears to be remarkably unconstrained.

Rarer LoF variants have a larger phenotypic effect than more common LoF variants
We investigated the effect of allele count (AC) on the phenotypic effect of LoF variants in our largest gene panel (599 monoallelic DDG2P genes). Specifically, we performed association tests with 16 DD-related traits that were significant in the previous analysis for groups of individuals with rare LoF variants in these genes that were present in just a single individual in UKB, compared with variants seen 5, 10, 50, or 100 or fewer times ( Figure 4). The group of individuals who had the rarest variants (AC ¼ 1) had the largest phenotypic effect change compared to the rest of the cohort, although the results were generally not significant as a result of low numbers.
However, across the phenotypes tested, both the effect size and the p value decreased as the AC increased, suggesting either that more common variants have a milder effect on phenotype or that more common variants are benign and are simply diluting the effect of rare pathogenic variants. No difference was observed between the effect of LoF variants in the first and second half of genes. In addition, 295 individuals had LoF variants that were previously classified as ''likely pathogenic'' or ''pathogenic'' in ClinVar, but no significant difference was detectable in their phenotypes compared with the remainder of the group who also had LoF variants.

Discussion
We have shown that rare, potentially damaging variants in genes and loci linked to dominant monogenic DD are pre-sent in adults in UK Biobank and result in a mild developmental phenotype. Individuals carrying these variants have notably reduced cognitive abilities and a lower socioeconomic status. Gene panel association tests suggest a strong and consistent trend for increasing phenotypic effects with rarer and more damaging variants. Although it is impossible to disentangle incomplete penetrance and variable expressivity in a population study, our findings are consistent with similar studies in clinically unselected population cohorts 4,[20][21][22][23][24] showing reduced penetrance of rare damaging variants in genes where similar variants cause rare monogenic forms of DD. Moreover, our results are robust to removal of individuals diagnosed with a childhood developmental disorder, suggesting that fully penetrant individuals are not driving the signal. We note that the variants identified in UKB are not necessarily the same ones that have been identified previously in clinical cases, and indeed very few of those we identified had previously been annotated in ClinVar. 25 We also note that our dataset most likely includes some predicted-LoF variants that do not actually result in a loss of function (either because of technical false positives or biological rescue through translation re-initiation, alternative splicing, etc.). Nonetheless, these issues are common to any clinical or research scenario where variants are prioritized from WES data, and our findings were robust when limited to likely LoF variants in a subset of 325 genes in which rare variants cause DD via a haploinsufficiency mechanism. The fact that our findings are robust to smaller, more stringent subsets of genes also suggests that the low effect sizes cannot simply be explained by a subset of variants in low penetrance (or non-causal) genes. Furthermore, rare predicted-LoF variants were found in individuals in genes in which similar variants are thought to be fully or nearly fully penetrant causes of very well-established developmental syndromes, but without the full clinical phenotype that would be expected, suggesting that there is a range of penetrance and expressivity in the general population.
Despite the large size of UKB, we were limited by the number of individuals of European ancestry carrying rare damaging variants in these genes, which meant some of our analyses were under-powered to show a significant effect. We were also limited by the clinical and phenotypic data available on these individuals, all of whom were over 40 years of age at recruitment; evaluation and diagnosis of DD was much less routine when these individuals were children and is less likely to be recorded in the HES codes of older adults. Nonetheless, when found in an appropriate clinical pediatric setting, rare damaging variants in these genes are widely considered diagnostic for DD and thus they might not be expected to be present in a population cohort. Our results suggest that, although the penetrance of variants across these genes is lower than would be expected from previous clinical studies, they do still exert a phenotypic effect on adults in the general population who are nonetheless healthy enough, and have sufficient capacity, to volunteer to participate in a biobank.
Variants that cause monogenic DD have historically been identified almost exclusively through clinical cohorts of affected children and families, and their effect on adults in the general population has not previously been evaluated. While clinical studies may overestimate the penetrance of such rare variants, population cohorts like UKB are likely to underestimate the penetrance as a result of ascertainment bias toward healthy individuals. 26 The penetrance and expressivity of variants in these genes could be affected by a number of different modifiers, including genetic variants in other genes, regulatory variants affecting gene expression, somatic mosaicism, and accumulated environmental factors. 5 The latter is particularly relevant when considering the effect of damaging variants in DDG2P genes on adults. It is interesting to note that, unlike most traits, the heritability of intelligence (i.e., general cognitive ability) increases dramatically with age, 27 suggesting a major role for gene-environment interactions as individuals become better able to select, modify, and optimize their environment. Further research is needed into the penetrance of rare, damaging variants in the general population using larger datasets, which may allow modifiers to be investigated to help explain why some individuals are more severely affected by particular genetic conditions than others.

Data and code availability
Original source data from UK Biobank is available on application from https://www.ukbiobank.ac.uk/. STATA regression analysis code is provided in supplemental information.